Part 1 - Project Based

DOMAIN: HEALTHCARE


CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions.University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.

DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

1. P_incidence
2. P_tilt
3. L_angle
4. S_slope
5. P_radius
6. S_degree
7. Class

PROJECT OBJECTIVE: Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.

1. Import and warehouse data:


• Import all the given datasets and explore shape and size of each.
• Merge all datasets onto one and explore final shape and size.

Import the datasets

Shape of data

Size of dataset

Checking the head and tail of the three separate datasets

Merging the three datasets into a combined dataset

2. Data cleansing:
• Explore and if required correct the datatypes of each attribute
• Explore for null values in the attributes and if required drop or impute values

Information about the datatypes.

Class has dataType as object we will have to change that.

As we can see the 'Class' column has a typographical error (typo) for Type_S, Type_H, Normal as tp_s , type_h , Nrmal respectively. We will correct that.

Checking for null values

There are no null values in the dataset

3. Data analysis & visualisation:
• Perform detailed statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

We can see that:


  • P_incidence The mean and the median are almost equal and there are no negative values.
  • P_tilt The mean and the median are almost equal and there are negative values. 75% values are less than 22 but the maximum value is 49.
  • L_angle The mean and the median are almost equal and there are no negative values.75% values are less than 63 but the maximum value is 125, the data may have outliers.
  • S_slope The mean and the median are almost equal and there are no negative values.75% values are less than 52 but the maximum value is 121, the data may have outliers.
  • P_radius The mean and the median are almost equal and there are no negative values.
  • S_Degree Mean is greater than Median so the data may be right skewed, the standard deviation is high.There are negative values.We can see 75% of values are less than 41 but maximum value is 418 so there are obvious outliers in the data.
  • Class There are 3 categories and Type_S has the maximum frequency.
  • Range of the data

    Checking the variance of all Columns.

    To measure the skeweness of every attribute

    As we can see that there is high skewness in the S_Degree Column.

    Univariate Analysis

    P_incidence

    Checking count of outliers

    Normality is maintained in the P_incidence column and there are 3 outliers.

    P_tilt

    Checking count of outliers

    We can see that the P_Tilt has slightly right skewed data and there are negative as well as positive outliers.

    L_angle

    Checking count of Outliers

    We can see slight right skewness due to 1 outlier.

    S_slope

    Checking Count of outliers

    P_radius

    Checking for outliers in P_radius.

  • We can see that the data is normally distributed, there are outliers at both ends.*

  • S_Degree

    Checking for outliers in S_Degree.

  • The S_Degree column is positively skewed and highly effected by outliers as we can see from the box plot.

  • Class (Target Variable)

    As we can see, Type_S has the maximum ie 48.4% of the entire dataset.

    Bi-variate Analysis

    Class and P_incidence Swarm plot Box plot Bar plot between Class and P_incidence attribute.

  • P_Incidence Value is larger for Type_S Class.
  • We can see some extreme values as well.

  • Normal Value is slightly higher than Type_H

  • Class and P_tilt Swarm plot, Box plot and Point plot between Class and P_tilt attribute.

  • Mean of Type_S is slightly higher than rest two.
  • There are outliers for Normal and Type_H class.
  • The Point plot shows that the mean value for Type_S is significantly higher than both Type_H and Normal.

  • Class and L_angle Swarm plot Box plot Point plot between Class and L_angle attribute.

  • L_angle has higher values for Type_S Class.
  • All the classes contain outliers.
  • L_angle value is low for Type_H Class.

  • Class and S_slope Swarm plot Box plot, Point plot between Class and S_slope attribute.

  • Type_H class has least values for S_Slope attribute.
  • There are outliers in Normal and Type_S class.
  • S_Slope is higher in Normal than Type_H and highest in Type_S.

  • Class and P_radius Swarm plot Box plot, Point plot between Class and P_radius attribute.

  • We can see P_radius value is more for Normal Class.

  • There are some extreme values for Type_S class.

  • All classes have outliers.

  • Class and P_radius Swarm plot Box plot, Point plot between Class and S_Degree attribute.

  • Extreme Outlier in Type_S.
  • Type_S has higher values for S_degree than the other two classes.
  • There are outliers for Normal and Type_S

  • MultiVariate Analysis

  • P_incidence has postive realtionship with all variables except P_radius. L_angle and S_slope have a higher relationship with P_incidence attribute.

  • P_tilt has higher Relationship with P_incidence and L_angle.There is no Relationship with s_slope and p_radius.

  • L_angle has postive Relationship with P_incidence, P_tilt, S_slope and S_degree. It has no relationship with P_radius.

  • S_slope has positive Relationship with P_incidence and L_angle.

  • P_radius has no Relationship with S_degree,P_tilt,L_angle.There maybe a negative relationship between P_incidence and P_radius.

  • S_degree has no strong positive Relationship with any of the variables.

  • It is evident that Type_s class is more compared to other two.

  • Normal class has higher values compared to Type_H.

  • From the diagonal we can see that the distribution of the three Classes is not the same.

  • we can see that Type_S contains higher values.

  • We can see that S_slope and P_incidence have high correlation.

  • P_incidence and L_angle have a correlation of 0.72.

  • S_Degree and P_incidence have a correlation of 0.64.

  • 4. Data pre-processing:
    • Segregate predictors vs target attributes
    • Perform normalisation or scaling if required.
    • Check for target balancing. Add your comments.
    • Perform train-test split.

    Segregating Predictor and Target attributes.

    Perform normalisation or scaling if required.

    Outlier Analysis

    Scaling

    We can see that all the columns have a standard deviation of around 1 .

    Check for target balancing. Add your comments.

    as we know : 2 = Type_S
    0 = Type_H
    1 = Normal

    As we can see the Target Variable is imbalanced. Type_s / 2 has 48.4% of all the data. This can lead to the model not learning about less distributed classes. This gives poor performance in unseen data.

    Perform train-test split.

    5. Model training, testing and tuning:
    • Design and train a KNN classifier.
    • Display the classification accuracies for train and test data.
    • Display and explain the classification report in detail.
    • Automate the task of finding best values of K for KNN.
    • Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

    Design and train a KNN classifier.

    Display the classification accuracies for train and test data.

    Training Acuracy is 92% and Testing Accuracy is 80%. Performance is less in test data.
    This is due to overfitting of data

    Display and explain the classification report in detail.

  • Precision – Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.
    As we can see this model has high precision for Type_S class and low precision for Type_H.

  • Recall – Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.
    We can see that recall is also high for Type_S. There is low recall for Normal class.

  • F1 score – The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
    Similarly there is high F1 Score for Type_S . The F1 score is low for both Normal and Type_H class.

  • The KNN Classifier above has an accuracy of around 81%.

  • Automate the task of finding best values of K for KNN.

    Training accuracy decreases as we increase the value of K.

    We can see that the maximum test accuracy occurs when k is less than 20. Thus we will fix k value less than 20

    Building Model with K = 13

    Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

    Logistic Regression

    Naive Bayes

    Support Vector Machines

    From the above results we can see that the SVM with poly kernel gives the most Accuracy on Test data ie 86%

    6. Conclusion and improvisation:

    Write your conclusion on the results.
    From the results from various other tuning techniques we an see that the SVM with poly kernel gives the best training and test accuracies.It performs well for both test and training sets. Also, SVC with sigmoid kernel gives the lowest results for both training and testing sets.

    Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

  • There is class imbalance in the given data, which may result in incorrect predictions due to biasing of the characteristics.
  • Clear description on each variables or just some basic information about their impact may help to understand problem statement better.
  • The data quality is good, but more information parameters would give better results.

  • Part 2 - Project Based

    DOMAIN: Banking and finance


    • CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base where majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.

    • DATA DESCRIPTION: The data consists of the following attributes:

    1. ID: Customer ID
    2. Age Customer’s approximate age.
    3. CustomerSince: Customer of the bank since. [unit is masked]
    4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
    5. ZipCode: Customer’s zip code.
    6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
    7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
    8. Level: A level associated to the customer which is masked by the bank as an IP.
    9. Mortgage: Customer’s mortgage. [unit is masked]
    10. Security: Customer’s security asset with the bank. [unit is masked]
    11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
    12. InternetBanking: if the customer uses internet banking.
    13. CreditCard: if the customer uses bank’s credit card.
    14. LoanOnCard: if the customer has a loan on credit card.

  • PROJECT OBJECTIVE: Build an AIML model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.

  • 1. Import and warehouse data:
    • Import all the given datasets and explore shape and size of each.
    • Merge all datasets onto one and explore final shape and size.

    Import the datasets

    Shape of Datasets

    Size of dataset

    Columns in the given separate datasets.

    Checking Sample Records for individual dataset

    We can see that most of the records for the second dataset are 0.

    Merging the datasets into a combined dataset

    2. Data cleansing:
    • Explore and if required correct the datatypes of each attribute
    • Explore for null values in the attributes and if required drop or impute values.

    Exploring the datatypes

  • We can see that all the columns have quantitative data types ie int64 or float64.

  • We should change the datatype for some columns to categories as they are categories and will help us in builiding the model later.

  • ZipCode column is considered as a numerical value whereas it is a categorial variable, hence changing its datatype in the dataset.

  • We can see that there are null values in the LoanOnCard column.

    Explore for null values in the attributes and if required drop or impute values.

    There are 20 null values in LoanOnCard column, since this is relatively a small number compared to the size of the entire dataset we can drop these null values.

    Also, the column ID has no further relevance in any model building hence we can drop this column.

    3. Data analysis & visualisation:
    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

  • We can see that mean and median are almost equal for all columns except Mortgage.

  • Median is zero for Mortgage column and Mean is around 56, so there will be positive skewness.

  • Standard Deviation of HighestSpend Column is quite high. Also, 75% values are less than 98 and the maximum value is 224.So there are outliers there.

  • MonthlyAverageSpend has 75% values around 2.5 and the max value is 10, so there are outliers there also.

  • We can see high fluctuation in Mortage column. 50% of data has zero values but maximum value is 635. This is hugely affected by outliers.

  • Univariate Analysis

    Distribution and outlier analysis of numerical variables

    Age

  • We can see that the distribution appears to be normal, with the center being a liitle wide.
  • There are no outliers in this column.

  • CustomerSince

  • We can see that there are negative values in this column, that makes no sense in this banking scenario.

  • Most of the Customers joined between 5 and 40 (units are masked).

  • There are no outliers in this column.

  • HighestSpend

  • We can see positive skewness in the data. For one transaction highest spend amount is between 45 to 100.

  • There are 96 Outliers in this column.
  • Few customers have spent over 180+ (units are masked).

  • MonthlyAverageSpend

  • There is a lot of positive skewness in this data.

  • Monthly average spend of the customer are mostly between 1 to 3(Units are masked).

  • There are a lot of outliers in this column, conveying that some customers spend a lot too.

  • Mortgage

  • Maximum number of cases have 0 value, telling us that there is no mortgage for maximum number of customers.

  • There is no proper distribution for this column.

  • There are many outliers in this column.

  • Distribution of categorical variables

    HiddenScore

  • We can see that all the four categories have almost equal distribution .

  • Category 1 has the maximum customers and the category 3 has the minimum customers.

  • Level

  • Level 1 has the maximum customers.

  • Level 2 and Level 3 have almost equal distrubution

  • Security

  • We can see that the bank does not have security assets for 89.6%

  • The bank has seccurity assets for only 10.4% of customers. This is very risky.

  • FixedDepositAccount

  • We can see that 94% customers dont have a fixedDepositAccount with the bank

  • InternetBanking

  • 60% Customers use Internet Banking and 40% don't use internet banking.

  • CreditCard

  • 70% users use the bank's credit card.

  • LoanOnCard (Traget Variable)

  • We can see that there is a lot of traget imbalance from these graphs.

  • 90% customers dont have loan on card and 9.6% customers have loan on card.

  • Bi-variate Analysis

    LoanOnCard vs CustomerSince

    LoanOnCard vs Age

    LoanOnCard vs HighestSpend

  • It is evident that people who have loan on card spend much more than the ones who dont have any loan.

  • Also, people how dont have loan on card tend to spend a lot, indicated by the outliers on the boxplot.

  • MonthlyAverageSpend vs LoanOnCard

  • Monthly Average spend is more for LoanOnCard = 1

  • Some people without the loan also tend to spend high, thud the outliers.

  • Average mean spend is significantly more for people with people with LoanOnCard

  • LoanOnCard vs Mortgage

  • Mortgage is high for LoanOnCard = 1
  • There are extreme values in both the cases.
  • Mortgage Mean values are more for loan holders

  • MonthlyAverageSpend vs HighestSpend

    We can see that there is clearly a linear relationship between these two variables.

    Correlation Between Variable

  • Clealry Age and CustomerSince are entirely correlated, so we'll only use one of these for model builiding.
  • Also, high correlation between HighestSpend and MothlyAverageSpend.

  • Hidden Score vs Loan on card

    We can see that the count for LoanOnCard = 1 is high for all the HiddenScore levels, this is becuase of Class imbalance, we will have to solve this class imbalance problem beore buiding the model.

    Multivariate Analysis

  • Clear linear Relationship between Age and CustomerSince.
  • Severe class Imbalance in LoanOnCard

  • 4. Data pre-processing:
    • Segregate predictors vs target attributes
    • Check for target balancing and fix it if found imbalanced.
    • Perform train-test split.

    Segregating Predictor and Target attributes.

    We dont need the ZipCode, and due to high Correlation in Age and CustomerSince, dropping these columns from X ie Predictor Variables

    Checking if all other variables have an impact on LoanOncard using hypothesis testing.

    We can see Security,internet banking and CreditCard does not have significant difference in target variable.So dropping these column before building model.

    Outlier Analysis

    Check for target balancing and fix it if found imbalanced.

  • As we can see there is a huge target imbalance in the dataset.
  • We will change this with the help of SMOTE since we can't use the Near Miss algorithm for such heavy imbalance.

  • Perform train-test split.

    For Balanced Data

    For imbalanced Data.

    5. Model training, testing and tuning:
    • Design and train a Logistic regression and Naive Bayes classifiers.
    • Display the classification accuracies for train and test data.
    • Display and explain the classification report in detail.
    • Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

    Logistic Regression

    For SMOTE balanced data

    Design and train a Logistic regression classifiers.

    Display the classification accuracies for train and test data.

    Display and explain the classification report in detail.

  • We can see that the accuracy score is 90% for the model.

  • For LoanOnCard = 1 the metrics are pretty close to LoanOnCard = 0

  • As we can see this model has higher precision value for 0 and low precision for 1.
  • We can see that recall is also high for 1 and low for 0.

  • F1 score – The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
  • The F1 score is 90 for both 0 and 1.

  • For Imbalanced Data

    Design and train a Logistic regression classifiers.

    Display the classification accuracies for train and test data.

    Display and explain the classification report in detail.

  • The overall accuracy for this model is higher than the one with the balanced data, but the model performance is lower due to target imbalance.

  • The Recall score is very low and the F1 score is also very low for LoanOnCard.

  • Naive Bayes Model

    For SMOTE balanced data

    Design and train a Naive Bayes classifiers.

    Display the classification accuracies for train and test data.

    Display and explain the classification report in detail.

  • We can see that the accuracy score is 84% for the model.

  • The metrics are close for both the classes.

  • As we can see this model has higher precision value for 1 and low precision for 0.
  • We can see that recall is high for 0 and low for 1.

  • The F1 score is 0.85 and 0.83 for 0 and 1 respectively.

  • For Imbalanced Data

    Design and train a Naive Bayes classifiers.

    Display the classification accuracies for train and test data.

    Display and explain the classification report in detail.

  • We can see that the accuracy score is 88% for the model.

  • The metrics are not at all close for both the classes, this is because of the class imbalance in the given dataset.

  • As we can see this model has higher precision, recall and f1-score value for 1 and lower values for 0.

  • Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

    K-Fold CV for finding best model

    For imbalanced Data

    For balanced Data

    6. Conclusion and improvisation:


    Write your conclusion on the results

  • We can see from the results that the Logistic Regression Model gives better performance than the other models.

  • The target imbalance leads to biased results.

  • The Naive Bayes Classifier does not perform very well for this dataset.

    Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the bank to perform a better data analysis in future.

  • The data should be collected such that there are almost equal distributions for all the variables or it must be tried to reduce the imbalance if possible.

  • Units being masked is ok but some scale should be given so that one can understand a little better about how a parameter will impact the model building.

  • Instead of the ZipCode, if the location were classified as rural or urban,semi-urban etc. it may have some impact in finding better customer base. Due to ease of access of using cards in urban areas.

  • As we can see for this data set the veracity is not too good,there were many columns which had no impact on the model.

  • There are negative values in the CustomerSInce Column, there is no relevance of that at all.

  • The Data description is not clear for the columns Security and FixedDepositAccount, it says that they are some values but from the data they look like categorical variables and hence, for the above build models they are considered as Categories.